403 errors when events API is used as a CRON

We have integrated PD using some bash scripts from where we issue curl requests and these scripts need to be executed as a CRON job. When the script is executed from shell, PD alert is successfully generated however it returns 403 status code when it is executed as a CRON. Do you have a clue what might be going wrong here?

We are using v2 api (https://v2.developer.pagerduty.com/docs/trigger-events)

Hi Junaid,

This is actually rate limiting, and in the v1 Events API; status code 403 is not used in the v2 Events API, and in v1 it signifies that the limit has been reached.

I think it would be worth trying the following as an experiment:

  1. Create a near-identical cron job that runs the same thing, but with a different API key pointing to a test service;
  2. Observe whether or not a 403 is encountered.

If you get more 403s, it is possible the cron job itself is sending too many events. Otherwise, there is probably some other process submitting events for the same key as used in the original cron job, which is causing the rate limit to be reached (because the total event volume across all processes that use the key is over 60/minute).

Hopefully that helps!

1 Like

Hi,

We aren’t generating alerts at that higher frequency. It appears to be deterministic issue, alert is not triggered from CRON but it is generated when I am running the script manually.

Are there any special character encodings that you think can cause issues here?

Hi,

The first thing I would suspect is that there are some environment variables that differ between the execution environment of the cron job and your shell environment. A really simple way to check this would be running the shell code env > /tmp/env.txt in the script that runs in the cron job, and comparing the contents of /tmp/env.txt after the cron job runs to the output of env run from within your shell session.

Hi Demitri,

We looked at the env parameters and they look fine though.

One more experiment we did was to change the CRON scheduled time. It appears that the script works when we have modified it. Although we aren’t generating anywhere near the 60 alerts per minute mark. What could be the reasons here in this case?

@Junaid_Muzammil

You may want to check to see what other tools are using the given integration key. It could be that the times that your task was previously configured to run coincided with times that another task or tool had been submitting events and saturating the rate limit. This would explain why changing the schedule made the 403 errors go away. Mind you, the rate limit is not enforced per IP address, but per integration key, which is why it would be important to keep track of where and how a given key is being used.

Other than that, I can’t yet think of anything else, especially in terms of why simply changing the schedule of a task would cause it to not hit the rate limit. That’s because cron schedules can specify down to the minute, and thus the highest frequency that a given task could be run is once per minute (unless you’re using a specialized / unorthodox implementation of cron).

Moreover, if the task submits roughly the same number of events to the Events API in any case, it would make no difference if it ran once per minute or once per day. To elaborate, it would either hit the rate limit in both cases (because it starts sending events at a rate higher than 60/minute, so how often each instance of it runs is irrelevant) or neither, assuming that the task takes less than one minute to run.

If you would like, we can start a private conversation wherein it would be safer for me to share the sensitive details of how the integration key in question is being used vis-à-vis some internal diagnostics on the PagerDuty side of things.

Sure, it will be good to start a private conversation.

Okay, great; I have sent you a private message.

If there are any bits of information pertaining to the solution that aren’t sensitive/privileged info, and that might benefit the community, I’d like to post them here in the thread once we’re done, with your permission.

To follow up here with what we did to find the cause:

We discovered that there were other processes submitting many events for the integration key in question, albeit from separate IP addresses, at exact intervals of a certain number of minutes. This was saturating the integration key at each N-minute mark of the hour.

Therefore, the reason why the cron job was hitting the rate limit but running the script manually did not was because the timing of running the script manually did not line up with the event spikes.

The recommended solution in cases like this is to find out what processes are submitting the highest volume of events, and if it is due to some error, take steps to correct it and reduce the volume. Other viable solutions would be staggering scheduled processes that send to PagerDuty through the Events API so that they don’t all run at once, or to use separate integration keys for the different processes.

2 Likes

Thanks @demitri for your support.